Kickstarter dataset project

Description

Reading and cleaning up the dataframe

The dataframe has 323750 rows

Let's put the two dataframes back together

We end up with a dataframe of 323746 rows out of the initial 323750 rows, considering we have removed 4 rows which had empty project names, which is coherent. IDs are unique, which is good.

We can conclude that there are 3797 rows with both country and usd pledged set to Null.

Exploratory Data Analysis

Let's explore the dataset by constructing graphs in order to get more insight about the features and characteristics of a successful project.

The successful and failed projects make up 87% of the whole projects states.

Knowing we want to study successful/failed projects (that's the majority of them, as we can see), I am going to drop the rows where the status is something else. We thus end up with a binary target variable that we will want to predict in our model.

60/40 distribusion is fine so we can consider the class balanced.

Let's investigate if the feature usd pledged is correctly based off from the feature pledged

We can see here that 5% (which is a lot of records as the majority of our dataset is with currency set to USD - see figure below) of usd pledged is not in accordance with pledged when the currency chosen is USD

Recalculate real_usd_pledged and usd_goal

There seems to be discrepancies regarding how the usd pledged feature has been calculated. We choose to recalculate it based on the column pledged by using a Python library to make the conversion into USD based on the campaign end_date and the project currency. We will also add usd_goal to have all the amounts in the same currency.

The following is outdated, the library is not maintained anymore. I'll load the final dataset for the following steps.

Leaving the code for information.

from exchangeratesapi import Api api = Api() df_final_mf = df_final[(df_final["currency"]!="USD") & (df_final["pledged"]!=0.0)].copy() dico = api.get_rates('USD', df_final_mf['currency'].unique().tolist(), start_date=df_final_mf['deadline'].min().strftime("%Y-%m-%d"), end_date=df_final_mf['deadline'].max().strftime("%Y-%m-%d"))len(dico['rates'].keys())from datetime import datetime from datetime import timedelta def convert_rates2(amount,currency,PstngDate, dico_curr): keys = list(dico_curr['rates'].keys()) if currency != 'USD' and amount != 0.0: try: return amount/dico_curr['rates'][PstngDate][currency] except KeyError: c_date = datetime.strptime(PstngDate, "%Y-%m-%d").date() previous_dates = [datetime.strptime(date, "%Y-%m-%d").date() for date in keys if datetime.strptime(date, "%Y-%m-%d").date() < c_date] previous_closest_date = max(previous_dates) print(PstngDate+ " becomes "+ previous_closest_date.strftime("%Y-%m-%d")) return amount/dico_curr['rates'][previous_closest_date.strftime("%Y-%m-%d")][currency] else: return amount def get_date_str(date): return date.strftime('%Y-%m-%d')%%time # TEST convert_rates2(1317.00, "SEK", '2014-12-14', dico)df_final['tmp_deadline'] = df_final.apply(lambda x: get_date_str(x['deadline']), axis=1)%%time df_final["real_usd_pledged"] = np.vectorize(convert_rates2)( amount=df_final['pledged'], currency=df_final['currency'], PstngDate=df_final['tmp_deadline'], dico_curr = dico )df_final_mf = df_final[(df_final["currency"]!="USD")].copy()dico_goal = api.get_rates('USD', df_final_mf['currency'].unique().tolist(), start_date=df_final_mf['deadline'].min().strftime("%Y-%m-%d"), end_date=df_final_mf['deadline'].max().strftime("%Y-%m-%d"))len(dico_goal['rates'].keys())%%time df_final["usd_goal"] = np.vectorize(convert_rates2)( amount=df_final['goal'], currency=df_final['currency'], PstngDate=df_final['tmp_deadline'], dico_curr = dico_goal )df_final = df_final.drop(columns=['tmp_deadline']).reset_index(drop=True)# Saving the final dataframe compression_opts = dict(method='zip',archive_name='kickstarter2.csv') df_final.to_csv('out2.zip', index=False, compression=compression_opts)df_final.describe()

More EDA on the final dataframe

Let's drop these because we can see that there is 0 backers and no country nor usd pledged previously, it seems to be a mistake in getting the data

I'll leave it as it is, but it's interesting to see that some duplicates seem genuine, others seem to be about the same project revamped/relaunched and others are also another rendition of the same project (play at theater and video for instance...).

It would be interesting to know more about the motives and mindset of people creating these projects 'again' (needs of funds again), are there also possible cases of reboot of past successful projects (hoax ?).

Overall, it still can be integrated in our model as we want to predict the success/failure of a campaign regardless.

Categories and main_categories

Successful projects with the most backers

Exploding kittens is in the first place ! I really enjoyed playing this game and know that its campaign was indeed pretty popular.

Successful projects with the highest goals

Successful projects with the highest pledges

It often seems to be projects that involve a BtoC rewards, product design and games category. So it would involve high goals and need a lot of backers for a successful outcome.

Distribution of goals and pledges

We take the log to better see the distributions as we have outliers in both cases.

Based on the above histogram, it seems the failed projects tend to have higher values (so higher goals)